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Abstract 

Numerous variable selection methods rely on a two-stage procedure, where a 
sparsity-inducing penalty is used in the first stage to predict the support, which 
is then conveyed to the second stage for estimation or inference purposes. In this 
framework, the first stage screens variables to find a set of possibly relevant variables 
and the second stage operates on this set of candidate variables, to improve estima¬ 
tion accuracy or to assess the uncertainty associated to the selection of variables. We 
advocate that more information can be conveyed from the first stage to the second 
one: we use the magnitude of the coefficients estimated in the first stage to define an 
adaptive penalty that is applied at the second stage. We give two examples of proce¬ 
dures that can benefit from the proposed transfer of information, in estimation and 
inference problems respectively. Extensive simulations demonstrate that this trans¬ 
fer is particularly efficient when each stage operates on distinct subsamples. This 
separation plays a crucial role for the computation of calibrated p-values, allowing 
to control the False Discovery Rate. In this setup, the proposed transfer results in 
sensitivity gains ranging from 50% to 100% compared to state-of-the-art. 

Keywords: Linear model, Lasso, Variable selection, p-values, False discovery rate, Screen 
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1 Introduction 


The selection of explanatory variables has attracted mnch attention these last two decades, 
particnlarly for high-dimensional data, where the nnmber of variables is greater than the 
nnmber of observations. This type of problem arises in a variety of domains, inclnding image 


analysis (Wang et ah 2008), chemometry (Chong and Jnn 2005) and genomics (Xing et al. 


2001, Ambroise and McLachlan 2002, Anders and Hnber 2010). Since the development 


of the sparse estimators derived from penalties snch as the Lasso (Tibshirani 1996) or 


the Dantzig selector (Candes and Tao 2007), sparse models have been shown to be able 


to recover the snbset of relevant variables in varions sitnations (see, e.g. Candes and Tao 


2007, 

Verzelen 

2012 

iBiihlmann 

2013, 


However, the conditions for snpport recovery are quite stringent and difficult to assess 
in practice. Furthermore, the strength of the penalty to be applied differs between the 
problem of model selection, targeting the recovery of the support of regression coefficients, 
and the problem of estimation, targeting the accuracy of these coefficients. As a result, 
numerous variable selection methods rely on a two-stage procedure, where the Lasso is used 
in the hrst stage to predict the support, which is then conveyed to the second stage for 
estimation or inference purposes. In this framework, the first stage screens variables to find 
a set of possibly relevant variables and the second stage operates on this set of candidate 
variables, to improve estimation accuracy or to assess the uncertainty associated to the 
selection of variables. 

This strategy has been proposed to correct for the estimation bias of the Lasso coef- 
hcients, with several variants in the second stage. The latter may then be performed by 


ordinary least squares (OLS) regression for the LARS/OLS Hybrid of Efron et al. (2004) 


(see also Belloni and Chernozhukov 2013), by the Lasso for the Relaxed Lasso of Mein- 


shausen (2007), by modified least squares or ridge regression for Liu and Yu (2013), or with 


‘any reasonable regression method” for the marginal bridge of Huang et al. (2008). 

The same strategy has been proposed to perform variable selection with statistical 


guarantees by Wasserman and Roeder (2009), whose approach was pursued by Meinshausen 


et al. (2009). The first stage performs variable selection by Lasso or other regression 


methods on a subset of data. It is followed by a second stage relying on the OLS, on the 
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remaining subset of data, to test the relevance of these selected variables. 0 

To summarize, the hrst stage of these approaches screens variables and transfers the 
estimated support of variables to the second stage for a more focused in-depth analysis. 
In this paper, we advocate that more information can be conveyed from the hrst stage 
to the second one, by using the magnitude of the coefficients estimated in the hrst stage. 
Improving this information transfer is essential in the so-called the large p small n de¬ 
signs which are typical in genomic applications. The magnitude of regression coefficients 
is routinely interpreted as a quantitative gauge of relevance in statistical analysis, can be 
used to dehne an adaptive penalty, following alternative views of sparsity-inducing penal¬ 
ties. These views may originate from variational methods regarding optimization, or from 
hierarchical Bayesian models, as detailed in Section In Sections and we give two 
examples of procedures that can beneht from the proposed transfer of magnitude in estima¬ 
tion and inference problems respectively. The actual benehts are empirically demonstrated 
in Section m 


2 Beyond Support: Magnitude 


We consider the following high-dimensional sparse linear regression model; 


y = X/3* + s , 


where y = (|/i, ■ ■ ■ , UnY is the vector of responses, X is the nxp design matrix with p ^ n, 
(3* is the sparse p-dimensional vector of unknown parameters, and e is a n-dimensional 
vector of independent random variables of mean zero and variance 

We discuss here two-stage approaches relying on a hrst screening of variables based on 
the Lasso, which is nowadays widely used to tackle simultaneously variable estimation and 
selection. The original Lasso estimator is dehned as: 


/3(A) = argmin J(/3) -h A ||/3||^ 

peMP 


( 1 ) 


^In their two-stage procedure, Liu and Yu (2013) also proposed to construct confidence regions and 
to conduct hypothesis testing by bootstrapping residuals. Their approach fundamentally differs from 


Wasserman and Roeder (2009), in that inference does not rely on the two-stage procedure itself, but on 


the properties of the estimator obtained in the second stage. 

^Though many sparsity-inducing penalties, such as the Elastic-Net, the group-Lasso or the fused-Lasso 
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where A is a hyper-parameter, and J(/3) is the data-fitting term. Throughout this paper, 
we will discuss regression problems for which J(/3) is dehned as 

J(/3) = t IIX/J - y||^ , 

but, except for the numerical acceleration tricks mentioned in Appendix the overall 
feature selection process may be applied to any other form of J(/3), thus allowing to address 
classihcation problems. 

Our approach relies on an alternative view of the Lasso, seen as an adaptive -^2 penal¬ 
ization scheme, following a viewpoint that has been mostly taken for optimization purposes 


(Grandvalet 1998, Grandvalet and Gann 1999, Bach et al. 2012). It may be formalized as 
a variational form of the Lasso: 


/3eKP 


min J(/3)-I-A ^ —/3 

RP,TeIRP Tj ■ 


J=l 1 

p 


( 2 ) 


s- t- ° 

i=i i=i 


A > 0 , j = l,...,p 


The variable r introduced in this formulation, which adapts the £2 penalty to the data, 
can be shown to lead to the following adaptive-ridge penalty: 




A 




(3) 


.tT 1/5,(a)| 

where the coefficients /5,(A) are the solution to the Lasso problem Q. 

Using this adaptive -£2 penalty returns the original Lasso estimator (see proof in Ap¬ 
pendix]^. This equivalence is instrumental here for dehning the data-dependent penalty 
(|^, implicitly determined in the hrst stage through the Lasso estimate, that will also be 
applied in the second stage. In this process, our primary aim is to retain the magnitude 
of the coefficients of /3(A) in addition to the support Sx = {j G {1, ...,p}|/5,(A) 7 ^ 0}: the 
coefficients estimated to be small in the hrst stage will thus also be encouraged to be also 
small in the second stage, whereas the largest ones will be less penalized. 

The variational form of the Lasso can be interpreted as a hierarchical model in the 


Bayesian framework (Grandvalet and Gann 1999). In this interpretation, together with A 


lend themselves to the approach proposed here, we will stick to the simple Lasso penalty throughout the 
paper. 
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and the noise variance, the Tj parameters of Problem Q dehne the diagonal covariance 
matrix of a centered Ganssian prior on (3 (assnming a Ganssian noise model on y). Hence, 
“freezing” the Tj parameters at the hrst stage of a two-stage approach can be interpreted 
as picking the parameters of the Ganssian prior on /3 to be nsed at the second stage. 


3 A Two-Stage Estimation Procedure: Lasso-j-Ridge 


In sparse linear regression models, several theoretical resnlts state conditions that ensnre 
asymptotical snpport recovery, that is, the recovery of the snbset of all relevant explanatory 
variables. One of the main resnlt reveals a necessary and snfficient condition for the selec¬ 
tion property of f'l-regnlarized least sqnares. Several variants of this condition have been 
proposed, snch as the irrepresentable condition, the restricted eigenvalne condition, or the 
mntnal incoherence condition. In a nntshell, this type of condition states that the snbset 
of trnly effective variables can be retrieved exactly, provided the relevant and irrelevant 
covariates are not too strongly correlated. However, the rate of convergence of the Lasso 
may be slow and many noise variables are selected if the estimator is chosen by a predictive 


criterion snch as cross-validation (Meinshansen 2007). These observations motivated the 


proposal of several two-stage procednres (Efron et al.||2004 Meinshansen||2007 Hnang et al 


2008, Belloni and Ghernozhnkov||2013| Lin and Yn||2013 ). They produce models with faster 
convergence, smaller bias, and even, under more restrictive assumptions, oracle guarantees. 

In this paper, we experimentally investigate the large p small n designs for the Lasso-|-OLS 
(Efron et al. 2004, Belloni and Ghernozhukov 2013) and Lasso-I-Ridge ( |Liu and Yu 2013) 
procedures, comparing them to a variant based on adaptive ridge. We do not work out the 


proofs of Liu and Yu (2013) to show the consistency of the adaptive ridge variant, since we 


believe that this transposition would be of low interest. 


3.1 Original Lasso+OLS and Lasso+Ridge Procedures 

In these two-stage procedures, the support Sx of the sparse Lasso estimator /3(A) of Equa¬ 
tion ([^ dehnes the set of possibly relevant variables. Then, either ordinary least squares 
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or ridge regression is applied to the selected predictors: 


/3(A,/i)= argmin J(/3) + p ||/3||2 , 

/3eKP:/3,=0j^5A 


where we have the Lasso+OLS for /i = 0. 


Belloni and Chernozhukov (2013) and Liu and Yu (2013) work out the rates that should 


govern the decay of the Lasso penalty parameter A for Lasso+OLS and Lasso+Ridge re¬ 
spectively, but they do not propose a practical means of setting the constants so as to dehne 


the actual estimator. In their experimental section, Liu and Yu (2013) however compute 


A by cross-validation, while the ridge parameter /r is set to 1/n, thereby following the rate 
decay that theoretically enjoys good estimation and prediction performances. 


3.2 Lasso+Adaptive Ridge Procedure 

In practice, the actual choice of the penalization parameters A and /i is very important 
regarding performances. Cross-validation is commonly used to estimate the penalty pa¬ 


rameter A of the Lasso estimator, and we follow Liu and Yu (2013) in using this scheme 


for setting A for Lasso+OLS, Lasso+Ridge and Lasso+Adaptive Ridge, dehned as: 




^ ^ A 

3(A,/r)= argmin ,,, 

f3eRP:^j=o,j^Sx |Pi(A)| 

where /3j(A) are the regression coefficients obtained by the Lasso with penalty parameter A. 
Then, as setting arbitrarily /i = 1/n can lead to very bad performances for Lasso+Ridge 
or Lasso+Adaptive Ridge, we also chose to set /i by cross-validation. 

Note that, if applied naively, this serial selection process is prone to overhtting, in the 
sense that the variables selected by the Lasso are likely to be correlated with the response 
variable, resulting in optimistic conclusions regarding variable importance, a phenomenon 


known as Freedman’s paradox in model selection (see Freedman 1983). Our protocol con¬ 
sists in cross-validating the complete serial process to select p once A has been chosen in 
the screening stage of the procedure (that is, A is fixed, but Sx is recomputed at each fold 


of the cross-validation process). Finally, following Meinshausen (2007), we set jointly A and 
/i by cross-validation, so that the A parameter of the Lasso screening is not optimized so as 
to minimize the expected prediction error of the Lasso estimator itself, but it is optimized 
so as to optimize this error for the Lasso+Adaptive Ridge estimator. 
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4 A Two-Stage Inference Procedure: Screen and Clean 


When interpretability is a key issue, it is essential to take into account the uncertainty as¬ 
sociated to the selection of variables inferred from limited data. Indeed, this assessment is 
critical before investigating possible effects, since there is no way to ascertain that the sup¬ 
port is identihable. Indeed, in practice, the irrepresentable condition and related conditions 
cannot be checked (Biihlmann 2013| ). 

A classical way to assess the predictor uncertainty consists in testing the signihcance 
of each predictor by statistical hypothesis testing and the derived p-values. Although p- 
values have a number of disadvantages and are prone to possible misinterpretations, it is 
the numerical indicator that most end-users rely upon when selecting predictors in high¬ 
dimensional context. Well-established and routinely used selection methods in genomics 


are univariate (Balding 2006). Although more powerful, multivariate approaches suffer 
from instability and lack of usual measure of uncertainty. It is only recently that means for 
computing p-values or conhdence intervals in the high-dimensional regression setup were 


proposed, originating with the work of Wasserman and Boeder (2009) and followed by 


others (Meinshausen et ah 2009, Biihlmann 2013, Liu and Yu 2013). From a practical 
point of view, these recent developments are essential for convincing practitioners of the 


benehts of multivariate sparse regression models (Boulesteix and Schmid 2014). Here, we 


build on the seminal work of Wasserman and Boeder (2009). We propose to introduce 
adaptive ridge in the cleaning stage to transfer more information from the screening stage 
to the cleaning stage, and thus to make a more extensive use of the subsample of the 
original data reserved for screening purposes. 


4.1 Original Screen and Clean Procedure 

The procedure considers a series of sparse models {JAIasA) indexed by a parameter A G A, 
which may represent a penalty parameter for regularization methods or a size constraint for 
subset selection methods. The screening stage consists of two steps. In the hrst step, each 
model JA is htted to (part of) the data, thereby selecting a set of possibly relevant variables, 
that is, the support of the model Sx- Then, in the second step, a model selection procedure 
chooses a single model with its associated Sy Next, the cleaning stage eliminates 
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possibly irrelevant variables from S^, resulting in the set S that provably controls the 
type one error rate. The original procedure relies on three independent subsamples of the 
original data P = "Di U'D 2 U so as to ensure the consistency of the overall process. The 
following chart summarizes this procedure, showing the actual use of data that is made at 
each step: 


p} 


screening stage 


cleaning stage 


step I {‘Di) 
fit model 


>{5a} 


step II (X^i,X^ 2 ) 


AgA 


select model 


> 5 . 


step III 


test support 


>5 


Under suitable conditions, the screen and clean procedure performs consistent variable 
selection, that is, it asymptotically recovers the true support with probability one. The two 
main assumptions are that the screening stage should asymptotically avoid false negatives, 
and that the size of the true support should be constant, while the number of candidate 
variables is allowed to grow logarithmically in the number of examples. These assumptions 


are brought back by Meinshausen et ah (2009) in more rigorous terms as the “screening 
property” and “sparsity property”. 

Empirically, Wasserman and Roeder ( |2009 ) tested the procedure with the Lasso, uni¬ 
variate testing, and forward stepwise regression at step I of the screening stage. At step 
II, model selection was always based on ordinary least squares (OLS) regression. The OLS 
parameters were adjusted on the “training” subsample Vi, using the variables in {iSaIasa, 
and model selection consisted in minimizing the empirical error on the “validation” sub¬ 
sample V 2 with respect to A. Cleaning was then hnally performed by testing the nullity of 


the OLS coefficients using the independent “test” subsample V^. Wasserman and Roeder 


( 2009[ ) conclude that the variants using multivariate regression (Lasso and forward step¬ 
wise) have similar performances, way above univariate testing. 

We now introduce the improvements that we propose here at each stage of the process. 
Our methodological contribution lies at the cleaning stage, but we also introduced minor 
modifications at the screening stage that have considerable practical outcomes. 


4.2 Adaptive-Ridge Cleaning Stage 


The original cleaning stage of Wasserman and Roeder (2009) is based on the ordinary least 
square (OLS) estimate. This choice is amenable to efficient exact testing procedure for 
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selecting the relevant variables, where the false discovery rate can be provably controlled. 
However, this advantage conies at a high price: 

• hrst, the procedure can only be used if the OLS is applicable, which requires that the 
number of variables that passed the screening stage is smaller than the number 
of examples \'D^\ reserved for the cleaning stage; 

• second, the only information retained from the screening stage is the support 
itself. There are no other statistics about the estimated regression coefficients that 
are transferred to this stage. 

We propose to make a more effective use of the data reserved for the screening stage by 
following the approach described in Section the magnitude of the regression coefficients 
/3(A) obtained at the screening stage is transferred to the cleaning stage via the adaptive- 
ridge penalty term. Adaptive refers here to the adaptation of the penalty term to the data 
at hand. The penalty metric is adjusted to the “training” subsample "Di, its strength is set 
thanks to the “validation” subsample and cleaning is eventually performed by testing 
the nullity of the adaptive-ridge coefficients using the independent “test” subsample "Da. 

The statistics computed from our penalized cleaning stage improve the power of the 
procedure: we observe a dramatic increase in sensitivity (that is, in true positives) at any 
false discovery rate (see Figure]^ of the numerical experiment section). With this improved 
accuracy also comes more precision: the penalization at the cleaning stage brings the 
additional beneht of stabilizing the selection procedure, with less variability in sensitivity 
and false discovery rate. Furthermore, our procedure allows for a cleaning stage remaining 
in the high-dimensional setup (that is, 

However, using penalized estimators raises a difficulty for the calibration of the statisti¬ 
cal tests derived from these statistics. We resolved this issue through the use of permutation 
tests. 

4.3 Testing the Significance of the Adaptive-Ridge Coefficients 

Student’s t-test and Fisher’s F-test are two standard ways of testing the nullity of the 
OLS coefficients. However, these tests do not apply to ridge regression, for which no exact 
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procedure exists. 


Halawa and El Bassiouni (1999) proposed a non-exact t-test, but it can be severely off 


when the explanatory variables are strongly correlated. For example, Cule et ah (2011) 
report a false positive rate as high as 32% for a significance level supposedly fixed at 5%. 
Typically, the inaccuracy aggravates with high penalty parameters, due to the bias of the 
ridge regression estimate, and due to the dependency between the response variable and 
the ridge regression residuals. 

The F-test compares the goodness-of-£t of two nested models. Let yi and yo be the 
n-dimensional vectors of predictions for the larger and smaller model respectively. The 
F-statistic 

P ^ l|y -yoll - lly -yill 


iiy-yii 

follows a Fisher distribution when yi and yo are estimated by ordinary least squares under 
the null hypothesis that the smaller model is correct. Although it is widely used for 
model selection in penalized regression problems (for calibration and degrees of freedom 
issues, see 


Hastie and Tibshirani 1990), the F-test is not exact for ridge regression, for the 


reasons already mentioned above - estimation bias and dependency between the numerator 
and the denominator in Equation (|^. Here, we propose to approach the distribution 
of the F-statistic under the null hypothesis by randomization. We permute the values 
taken by the explicative variable to be tested, on the larger model, so as to estimate the 
distribution of the F-statistic under the null hypothesis that the variable is irrelevant. This 
permutation test is asymptotically exact when the tested variable is independent from the 
other explicative variables, and is approximate in the general case. It can be efficiently 
implemented using block-wise decompositions, thereby saving a factor p, as detailed in 
Appendix [Bj 

Table shows that, compared to the standard t-test and F-test (see e.g. Hastie and 


Tibshiranl||1990), the permutation test provides a satisfactory control of the significance 


level. It is either well-calibrated or slightly more conservative than the prescribed signih- 
cance level, whereas the standard t-test and F-test result in false positive rates that are way 
above the asserted significance level, especially for strong correlations between explanatory 
variables. These observations apply throughout the experiments reported in Section 
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Table 1: Expected false positive rate FPR (or type-I error) and sensitivity SEN (or power) 
computed over 500 simulations and over the variables selected in the screening stage. The 
prescribed significance level is 5%. The IND, BLOCK, GROUP and TOEP“ designs are 
fully described in Section 


Simulation design 

IND 

BLOCK 

GROUP 

TOEP- 

FPR 

SEN 

FPR 

SEN 

FPR 

SEN 

FPR 

SEN 

permutation F-test 

5.1 

92.4 

3.9 

86.7 

3.9 

62.3 

4.7 

81.9 

standard F-test 

9.9 

93.1 

11.8 

89.6 

14.8 

73.0 

15.4 

87.1 

standard t-test 

8.0 

94.0 

12.4 

93.1 

5.8 

95.7 

7.9 

85.1 


Testing all variables results in a multiple testing problem. We propose here to control the 
false discovery rate (FDR), which is dehned as the expected proportion of false discoveries 
among all discoveries. This control requires to correct the p-values for multiple testing 


(Benjamini and Hochberg 1995). The overall procedure is well calibrated as shown in 


Section |5l 


4.4 Modifications at Screening Stage 


Wasserman and Boeder (2009) propose to use two subsamples at the cleaning stage in order 


to establish the consistency of the screen and clean procedure. Indeed, this consistency 
relies partly on the fact that all relevant variables pass the screening stage with very 


high probability. This “screening property” (Meinshausen et ah 2009) was established 


using the protocol described in Section 4.1 To our knowledge, it remains to be proved 


for model selection based on cross-validation. However, Wasserman and Boeder (2009) 


mention another procedure relying on two independent subsamples of the original data 
D = Di U P 2 , where model selection relies on leave-one-out cross-validation on Vi and T >2 
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is reserved for cleaning. The following chart snmmarizes this modified procednre: 


screening stage 


cleaning stage 


, , stepI(Di) , , stepII(X>i) step III (X>2) 

|1,... ,p| —-^ S 


fit model 


select model ^ test support 


Hence, half of the data are now devoted to each stage of the method. We followed here this 
variant, which resnlts in important sensitivity gains for the overall selection procednre. 


We slightly depart from (Wasserman and Roeder 2009), by selecting the model by 10- 
fold cross-validation, and, more importantly, by nsing the snm of sqnares residnals of the 


penalized estimator for model selection. Note that Wasserman and Roeder (2009), and later 


Meinshansen et al. (2009) based model selection on the OLS estimate nsing the snpport 


S\. This choice implicitly limits the size of the selected snpport < |, which is actnally 


implemented more stringently as and ^ by Wasserman and Roeder (2009) 


and Meinshansen et al. (2009) respectively. Onr model selection criterion allows for more 
variables to be transferred to the cleaning stage, so that the screening property is more 
likely to hold. 


5 Numerical Experiments 

Variable selection algorithms are difficnlt to assess objectively on real data, where the trnly 
relevant variables are nnknown. Simnlated data provide a direct access to the gronnd trnth, 
in a situation where the statistical hypotheses hold. 

5.1 Simulation Models 

We consider the linear regression model Y = Xf3* + e, where V is a continuous response 
variable, X = (Xi,..., Xp) is a vector of p predictor variables, f3* is the vector of unknown 
parameters and £ is a zero-mean Gaussian error variable with variance cr^. The parameter 
/3* is sparse, that is, the support set S* = [j G {1, ...,p}|/3* ^ O} indexing its non-zero 
coefficients is small |iS*| p. 

Variable selection is known to be affected by numerous factors: the number of examples 
n, the number of variables p, the sparseness of the model |5*|, the correlation structure of 
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the explicative variables, the relative magnitude of the relevant parameters and 

the signal-to-noise ratio SNR. 

In our experiments, we varied n G {250,500}, p G {250,500}, |5*| G {25,50}, p G 
{0.5,0.8}. We considered four predictor correlation structures: 

IND independent explicative variables following a zero-mean, unit-variance Gaussian 
distribution: X ~A/’(0,I); 

BLOCK dependent explicative variables following a zero-mean Gaussian distribution, with 
a block-diagonal covariance matrix: X ~ A/'(0, S), where T^a = 1, Sjj = p for 
all pairs (f, j), j ^ i belonging to the same block and Sjj = 0 for all pairs (i, j) 
belonging to different blocks. Each block comprises 25 variables. 

The position of relevant variables is dissociated from the block structure, that 
is, randomly distributed in {l,...,p}. This design is thus difficult for variable 
selection. 

GROUP same as BLOCK, except that the relevant variables are gathered a single block 
when |iS*| = 25 and in two blocks when |iS*| = 50, thus facilitating group variable 
selection. 

TOEP“ same as GROUP, except that for all pairs (f, j), j ^ i belonging to 

the same block and Sjj = 0 for all pairs (f,j) belonging to different blocks. 

This design allows for strong negative correlations. 

The non-zero parameters /d* are drawn from a uniform distribution L/(10“^,l) to enable 
different magnitudes. Finally, the signal to noise ratio, dehned as SNR = /a"^ 

varies in {4, 8, 32}. 

5.2 Two-St age Estimation 

In the following, we discuss the IND BLOCK, GROUP and TOEP“ designs with n = 250, 
p = 500, |iS*| = 50 and p = 0.5. We report results with three different noise levels. The 
relative behavior of the estimation methods is similar for high and medium noise levels 


13 



(respectively SNR = 4 and SNR = 8), with more significant differences for medinm noise 
levels. The sitnation then drastically changes for the low noise level (SNR = 32). 

We compare the variants of the two-stage estimation methods based on the predictive 
mean sqnared error. Similar conclusions would be drawn from the accuracy measures on 
the vector of coefficients (3*. Figure [T] displays the boxplots of prediction error obtained 
over 500 simulations for each design. 

There is no benefit in a post-Lasso estimation step for high an medium noise levels 
(SNR G {4, 8}). OLS and ridge post-processing then have important detrimental effects and 
adaptive ridge has still a slight unfavorable effect. It is only when the two-step procedure 
is jointly optimized with respect to the two penalization parameters (by cross-validation), 
that some improvements become visible for the first three setups. 

When the signal-to-noise ratio is high (SNR = 32), Lasso highly benefits from the second 
stage whatever it may be (OLS, ridge or adaptive-ridge). There is a slight edge to adaptive 
ridge when variables are independent, but otherwise all methods are at par. Globally, the 
best option here consists again in jointly optimizing the two stages with respect to the two 
penalization parameters; some additional improvements come into view. 

Compared to previous studies, which mainly focused on large sample and/or low-noise 
settings, our experiments demonstrate that post-Lasso estimation can have consequential 
beneficial or detrimental effects in small sample regression. In addition to the experimental 
design, the results vary also considerably according to the strategy governing the choice of 
the penalty parameters. Other experiments (not shown here) attest that using more strin¬ 


gent screening stages (using the so-called “1-SE rule” of Breiman et ah 1984, that chooses 
the highest penalty within one standard deviation of the minimum of cross-validation) lead 
to better post-Lasso estimation in some experimental setups, but this is not systematic: in 
the TOEP“ design, this is by far the least favorable option. Overall, the joint optimization 
with respect to the two penalization parameters seems to be a very challenging contender. 
This is also true when the Lasso screening is followed by OLS or ridge regression. The 
joint optimization of penalization parameter favors a stringent Lasso screening compared 
to the strategy based on serial cross-validation, and a less stringent one compared to the 
1-SE rule. Though this solution is the most expensive from the computational viewpoint. 
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it seems to be also the most effective one regarding predictive mean squared error. 


5.3 Two-St age Inference 


In the following, we discuss the IND BLOCK, GROUP and TOEP“ designs with n = 250, 
p = 500, |iS*| = 25, p = 0.5 and SNR = 4, since the relative behavior of all methods is 
representative of the general pattern that we observed across all simulation settings. These 
setups lead to feasible but challenging problems for model selection. 

All variants of the screen and clean procedure are evaluated here with respect to their 
sensitivity (SEN), for a controlled false discovery rate FDR. These two measures are the 
analogs of power and significance in the single hypothesis testing framework: 


SEN = E 


TP 




TP+ FN 


{{TP+FN)>0} 


FDR = E 


FP 




TP+ FP 


{{TP+FP)>0} 


where FP is the number of false positives, TP is the number of true positives and FN is 
the number of false negatives. 

We first show the importance of the cleaning stage for FDR control. We then show 


the benefits of our proposal compared to the original procedure of Wasserman and Roeder 


(2009) and to the univariate approach. The variable selection method of Lockhart et al. 


(2014) was not included in these experiments, because it did not produce convincing results 


in these small n large p designs where the noise variance is not assumed to be known. 


Importance of the Cleaning Stage Table shows that the cleaning step is essential 
to control the FDR at the desired level. In the screening stage, the variables selected by 
the Lasso are way too numerous: first, the penalty parameter is determined to optimize the 
cross-validated mean squared error, which is not optimal for model selection; second, we 
are far from the asymptotic regime where support recovery can be achieved. As a result, 
the Lasso performs rather poorly. Cleaning enables the control of the FDR, leading of 
course to a decrease in sensitivity, which is moderate for independent variables, and higher 
in the presence of correlations. 


Comparisons of Controlled Selection Procedures Figure [^provides a global picture 
of sensitivity according to FDR, for the test statistics computed in the cleaning stage. First, 
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Table 2: False discovery rate FDR and sensitivity SEN, compnted over 500 simulations for 
each design. The screening stage (before cleaning) is not calibrated; the cleaning stage is 
calibrated to control the FDR below 5%, using the Benjamini-Hochberg procedure. Our 
adaptive-ridge (AR) cleaning is compared with the original (OLS) cleaning and univariate 
testing (Univar). 


Simulation design 

IND 

BLOCK 

GROUP 

TOEP- 

FDR 

SEN 

FDR 

SEN 

FDR 

SEN 

FDR 

SEN 

Before cleaning 

76.7 

87.5 

76.0 

83.9 

38.9 

86.2 

79.9 

56.5 

AR cleaning 

4.2 

76.1 

2.8 

64.8 

1.7 

37.7 

4.3 

39.6 

OLS cleaning 

3.9 

48.3 

3.1 

37.1 

2.5 

17.9 

3.7 

25.3 

Univar 

4.4 

40.4 

86.4 

71.0 

5.3 

100.0 

4.2 

28.4 


we observe that the direct univariate approach, which simply considers a t-statistic for each 
variable independently, is by far the worst option in the IND, BLOCK and TOEP“ designs, 
and by far the best in the GROUP design. In this last situation, the univariate approach 
conhdently detects all the correlated variables of the relevant group, while the regression- 
based approaches are hindered by the high level of correlation between variables. Betting 
on the univariate approach may thus be prohtable, but it is also risky due to its extremely 
erratic behavior. Second, we see that our adaptive-ridge cleaning systematically dominates 
the original OLS cleaning. To isolate the effect of transfering the magnitude of weights from 
the effect of the regularization brought by adaptive-ridge, we show the results obtained from 
a cleaning step based on plain ridge regression (with regularization parameter set by cross- 
validation). We see that ridge regression cleaning improves upon OLS cleaning, but that 
adaptive-ridge cleaning brings this improvement much further, thus conhrming the value 
of the weight transfer from the screening stage to the cleaning stage. 

Table 1^ shows the actual operating conditions of the variable selection procedures, when 
a threshold on the test statistics has to be set to control the FDR. Here, the threshold is set 
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to control the FDR at a 5% level, using a Benjamini-Hochberg correction. This control is 
always effective for the screen and clean procedures, but not for variable selection based on 
univariate testing. In all designs, our proposal dramatically improves over the original OLS 
strategy, with sensitivity gains ranging from 50% to 100%. All differences in sensitivity 
are statistically significant. The variability of FDR and sensitivity is not shown to avoid 
clutter, but the smallest variability in FDR is obtained for the adaptive-ridge cleaning, 
while the smallest variability in sensitivity is obtained for univariate regression, followed 
by adaptive-ridge cleaning. The adaptive ridge penalty thus brings about more stability to 
the overall selection process. 


6 Discussion 


We propose to use the magnitude of regression coefficients in two-stage variable selection 


procedures. First, we use the connection between the Lasso and adaptive-ridge (Grand- 


valet 1998) to convey more information from the screening stage to the second stage: the 


magnitude of the coefficients estimated at the screening stage is transferred to the second 
stage through penalty parameters. 

Empirically, our procedure brings marginal improvements when the second stage aims 
at improving the regression coefficients ( Belloni and Chernozhukov|2013 , Liu and Yu||2013 ), 
and it provides sensible improvements compared to the original screen and clean procedure 
( Wasserman and Roeder||2009 ) when assessing the uncertainties pertaining to the selection 
of relevant variables. In the hrst setup, screening and estimation are performed on the 
same data set, whereas in the second one, the screening and cleaning stages operate on two 
distinct subsamples of data: the transfer is more valuable in this situation. 

Regarding post-Lasso estimation, our experiments demonstrate that two-stage methods 
can have consequential benehcial or detrimental effects in small sample regression. The 
results vary considerably according to the strategy governing the choice of the penalty 
parameters, but the joint optimization with respect to the two penalization parameters is 
the most effective one regarding predictive mean squared error. 

For screen and clean, we obtained a better control of the False Discovery Rate, which 
extends to more difficult settings, with high correlations between variables. Furthermore, 
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the sensitivity obtained by onr cleaning stage is always as good, and often much better 
than the one based on the ordinary least squares. The penalized second step also allows 
for a less severe screening, since the second stage can then handle more than n/2 variables. 
Our procedure can thus be employed in very high-dimensional settings, as the screening 


property (that is, in the words of Biihlmann (2013), the ability of the Lasso to select all 
relevant variables) is more easily fulhlled, which is essential for a reliable control of the 
false discovery rate. 

Several interesting directions are left for future works. The second stage can accom¬ 
modate arbitrary penalties, and our efficient implementation applies to all penalties for 
which a quadratic variational formulation can be derived. This is particularly appealing 
for structured penalties such as the fused-lasso or the group-Lasso, allowing to use the 
knowledge of groups at the second stage, through penalization coefficients. 

On the theoretical side, many interesting issues are raised. In particular, we would 
like to back-up the empirical improvements that have been almost systematically observed 
by an apposite analysis. We believe that two tracks are promising: hrst by exploiting 
that the screening stage transfers a quantihed response to the cleaning stage through the 
penalization coefficients, and second, that screening needs not to be stringent, due to the 
ability of our second stage to handle more variables. 


Software 

Software and simulations are in the form of R package named “ridgeAdap” available on 
the personal author page (https://www.hds.utc.fr/~becujean). 
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A Variational Equivalence 

We show below that the qnadratic penalty in (3 in Problem (|^ acts as the Lasso penalty 

A||/3|li. 

Proof. The Lagrangian of Problem ([^ is: 



Tims, the hrst order optimality conditions for tj are 



dTj ^ 


-A/3| -f z/Q - Uj = 0 
—A/3| J- z/q t*^ = 0 , 
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where the term in Uj vanishes dne to complementary slackness, which implies here VjT* = 0. 


Together with the constraints of Problem (|^, the last eqnation implies Tj 
Problem (|^ is eqnivalent to 


hence 


which is the original Lasso formnlation. 


□ 


B Efficient Implementation 

Permutation tests rely on the simulation of numerous data sampled under the null hypoth¬ 
esis distribution. The number of replications must be important to estimate the rather 
extreme quantiles we are typically interested in. Here, we use B = 1000 replications for the 
q = variables selected in the screening stage. Each replication involving the fitting of a 
model, the total computational cost for solving these B systems of size q on the q selected 
variables is 0{Bq{q^ + q^n)). In the situation where q B, great computing savings can 
be obtained using block-wise decompositions and inversions. 

First, we recall that the adaptive-ridge estimate, computed at the cleaning stage, is 
computed as 


/3= (xTx + A)“'xTy , 


where A is the diagonal adaptive-penalty matrix dehned at the screening stage, whose jth 
diagonal entry is A/r*, as defined in (jTj-[^. 


In the F-statistic Q, the permutation affects the calculation of the larger model yi. 


which is denoted y[^^ for the 6th permutation. Using a similar notation convention for the 
design matrix and the estimated parameters, we have y^^^ = X*'^^/3*' \ When testing the 
relevance of variable j, X^^^ is dehned as the concatenation of the permuted variable 
and the other original variables: X*^^) = xi,..., Xj_i, x^+i, ...Xp). Then, ^ can be 
efficiently computed by using G M, X\j G and 0\,j G dehned as follows: 
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Indeed, using the Schur complement, one writes /3 


(b) 


as follows: 



1 




Hence, /3 can be obtained as a correction of the vector of coefficients (3\j obtained under 
the smaller model. The key observation to be made here is that does not inter¬ 
vene in the expression -|- \\j) ^ , which is the bottleneck in the computation of 

a(^), and It can therefore be performed once for all permutations. Additionally, 
(X^X\j -|- A\j) ^ can be cheaply computed from (X^X + A) ^ as follows: 


(X-^X + A) 


(X"^X +A) 


-1 


-I 


-1 




X'^XT A) ^ [(X^X +A) ^ 
-I jj L 




Thus we compute (X^X -|- A) ^ once, hrstly correct for the removal of variable j, secondly 
correct for permutation b, thus eventually requiring 0{B{q^ -|- q^n))) operations. 
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SNR = 8 


SNR = 32 



Figure 1: Mean prediction error computed over 500 simulations for each design. Lasso 
direct estimation (L) is compared to: Lasso screening followed by OLS estimation (L+0), 
Lasso screening followed by ridge estimation (L+R), Lasso screening followed by adaptive- 
ridge estimation (L-l-A), jointly optimized Lasso screening with adaptive-ridge estimation 
(L&A). 
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Figure 2: Sensitivity SEN versus False Discovery Rate FDR (the higher, the better). Lasso 
screening followed by: adaptive-ridge cleaning (red solid line), ridge cleaning (green dashed 
line), OLS cleaning (blue dotted line); univariate testing (black dot-dashed line). All curves 
are indexed by the rank of the test statistics, and averaged over the 500 simulations of each 
design. The vertical dotted line marks the 5% FDR level. 
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